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© A method for adjusting network parameters In a multi-layer perceptron device, and perceptron 
device provided with means for executing the method. 



CO 
CO 

5 



Q. 



© The invention provides a perceptron device with 
an improved learning behaviour. The training is ef- 
fected by alternating forward propagating steps 
wherein an input vector is presented and the pro- 
cessing result compared to an associated target 
vector; and backward propagating steps wherein the 
comparison difference, by means of a learning rate 
is used for updating th network parameters. Improve- 
ment is attained by two stratagems. 

a) the learning rate is eta, = etao x M/KN, wherein 
N is the number of inputs to the processing 
element fed by the parameter value to be up- 
dated, K is the number of outputs from that pro- 
cessing element and M is the number of inputs to 
a processing element of the next layer; 

b) the learning is done in three steps, first forward 
propagation, then backward propagation, then 
again forward propagation. The improvement at- 



tained by the updating is compared to a discrimi- 
nation level. If the improvement is bigger, the 
learning rate is decreased, if smaller, the learning 
rate is increased. 
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A METHOD FOR ADJUSTING NETWORK PARAMETERS IN A MULTI-LAYER PERCEPTRON DEVICE, AND 
PERCEPTRON DEVICE PROVIDED WITH MEANS FOR EXECUTING THE METHOD. 



FIELD OF THE INVENTION 

The invention generally relates to adjusting net- 
work parameters of a perceptron devide. A percep- 
tron device consists of a sequence of an input 
layer, at least one hidden layer and an output layer. 
Each layer (except the input layer) comprises a 
plurality of processing elements that are each fed 
by one or more elements of the preceding layer, in 
the limiting case by all elements of the preceding 
layer. Each interconnection between two elements 
in contiguous layers is provided with a network 
parameter which is a factor for multiplication by the 
quantity, usually an analog quantity, that is trans- 
ported along the interconnection in question. Such 
factor may be positive, negative, or zero and, in 
principle, may have an arbitrary value. For initi- 
ation, the network is trained by presenting a set of 
input vectors to the set of elements of the input 
layer. Each input vector produces a result vector at 
the outputs of the output layer, which result vector 
is compared to an associated intended target vec- 
tor. The difference between result vector and target 
vector may be calculated, for example, as a sum of 
squared differences, each difference relating to a 
particular vector component of the result vector. 
The network is adjusted by changing the network 
parameters after presentation of one or more input 
vectors. The use of a perceptron device is for 
recognition of various multi-quantity patterns, such 
as pixel patterns (e.g. representing characters), 
acoustic patterns, fingerprints and others. The 
mapping of various operations on respective hard- 
ware elements is considered a degree of freedom 
which is open to choice. 



BACKGROUND ART 

The perceptron art has been extensively de- 
scribed in various publications which are summa- 
rized in US Patent Application Serial No. 24,998 to 
Tomillinson, filed March 12, 1987, corresponding 
PCT Application WO 88/07234, herein incorporated 
by reference. This patent application describes 
continuous-time behaviour, and supplements the 
general principle explained supra by various other 
interconnection patterns both within one single lay- 
er and between successive non-continguous layers. 
Although this may in various circumstances pro- 
duce good results, the necessary hardware addi- 
tions render the system complex. 



SUMMARY OF THE INVENTION 

Among other things, it is an object of the 
present invention to allow, at a relatively elemen- 

s tary setup compared to Tomillinson's, to provide an 
increased learning speed, improved learning facili- 
ties, and less susceptibility against instabilities and 
other undesired behaviour for the perceptron. The 
object is realized in that according to one of its 

to aspects the invention provides a method for adjust- 
ing network parameters in a multi-layer perceptron 
device that has an initial layer of input elements, a 
final layer of processing elements, and a sequence 
of at least one hidden layer of processing ele- 

15 ments, wherein each preceding layer produces its 
output quantities to feed its next successive layer 
under multiplication by respective parameter val- 
ues, said method comprising under presentation of 
pairs of a source vector at the device input and a 

20 target vector at the device output of forward propa- 
gation steps for generating a result vector and 
backward propagating steps wherein under control 
of a difference between result vector and the asso- 
ciated target vector said respective parameter val- 

25 ues are updated in a steepest descent mthod hav- 
ing a normalized learning rate eta, wherein an initial 
guess for said learning rate is: 
eta, = etao x f(M,N,K), 

wherein etao is an overall learning rate for the layer 
30 in question, etaj is a learning rate for updating a 
particular parameter value, N is the number of 
inputs to the processing element fed by the param- 
eter value in question, K is the number of outputs 
from that processing element, and M is the number 
35 of inputs to processing elements of the next layer, 
and wherein the derivatives & is positive, while & 
and are negative for the actual value ranges of 
M,N,K. 

It has been found experimentally that the speed 
40 with which the system responds is optimized for 
particular values of the learning rate. Both for high- 
er values and for lower values of the learning rate 
the necessary number of iteration steps increases. 
A particular further aspect of the invention, 
45 however with essentially the same object, is re- 
alized by a method for adjusting network param- 
eters in a multilevel perceptron device, said meth- 
od comprising the steps of: 

- loading an input vector into input elements of said 
so perceptron and propagating any processing result 

in a forward propagating step until generation of a 
result vector; 

- under control of a first difference between said 
result vector and an associated target vector in 
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accordance to a steepest descent method adjusting 
said network parameters in a back propagating 
step; 

- repeating said forward propagating step with re- 
spect to the same input vector after said adjusting 
and calculating a second difference with respect to 
said target vector; 

- comparing an improvement between said differ- 
ences with a discrimination level and in case of a 
smaller improvement raising a learning rate of said 
steepest descent method but under control of a 
larger improvement lowering said basic parameter. 
Also in this aspect there appears to be an optimum 
learning rate. 

The invention also relates to a perceptron de- 
vice provided with means for accurately and fast 
adjusting its network parameters. 

Various further advantages aspects of the in- 
vention are recited in dependent Claims. 

BRIEF DESCRIPTION OF THE FIGURES 

The invention hereinafter will be explained in 
detail with respect to the following Figures. 
Figure 1 exemplifies a three-layer perceptron; 
Figures 2 - 2a are an elementary flow chart of 
the updating stratagem; 

Figures 3a - 3c show a set of formulae explain- 
ing the optimization. 



EXEMPLARY DESCRIPTION OF A THREE-LAYER 
PERCEPTRON 

Figure 1 exemplifies a three-layer perceptron. 
It has a first layer of eight input elements 20-34, a 
second layer of processing elements 36-46 and a 
third level of processing elements/output elements 
48-52. Each input element has one input connec- 
tion and three output interconnections. In the sec- 
ond layer each element has four input interconnec- 
tions fed by associated input elements and two 
output interconnections. In the output layer each 
processing element has four input interconnections. 
There are no interconnections either within a single 
layer or between elements of layers that are not 
directly contiguous. Instead of only a single hidden 
layer, the perceptron may have a succession of 
hidden layers, still each layer only feeding the next 
successive layer. The interconnection pattern be- 
tween contiguous layers need not be uniform. In 
principle, each element could be fed by all ele- 
ments of the next preceding layer (if present). It is 
not necessary that the number of elements per 
layer decrease monotonically in the direction from 
the input layer to the output layer. Likewise, it 
could increase or be constant. The interconnection 



pattern as shown makes various parameter values 
for non-existent interconnections essentially equal 
to zero. Block 54 represents a training processor 
that has an associated memory 56 (RAM or ROM). 

s Upon addressing via line 60 the memory presents 
an input vector that is transmitted via line 66 to the 
input elements 20-34. Each component of this vec- 
tor may be analog or multi-valued digital. The pa- 
rameter values are controlled via control line 68; 

w these values may be analog, multi-value digital, 
positive, zero or negative. Each processing element 
36-52 executes an addition operation on all input 
quantities received. The final result is in this case a 
three-component vector, which is component by 

75 component compared to a target vector associated 
to the input vector. The differences are used (along 
line 68) for updating the parameter values. A set of 
input vectors would be used for training. After 
training, the perceptron would, for example, recog- 

20 nize each possible character that is fed as an input 
vector on inputs 58 and is processed in the same 
way as the training vectors. The result from outputs 
48, 50, 52 would then without further processing 
appear at output 64. In case of character recogni- 

25 tion, this could then be a 1-out-of-n code. 

Line 66 and inputs 58 are fed to input elements 
20-34 in a controllable multiplex organization, de- 
tails not shown. Applying the parameter values 
may be effected according to analog multiplication 

30 methods known from analog computers, or by digi- 
tal multiplication for which digital multiplier chips or 
in-processor embedded multipliers have been 
widely commercialized. Processor 54 executes var- 
ious standard computer operations, such as deter- 

35 mining differences, summing of squared differ- 
ences, calculating of update expressions along ele- 
mentary mathematical expression. For brevity, no 
further hardware has been detailed. 

40 

DESCRIPTION OF AN OPTIMIZATION 

Figure 2 is an elementary flow chart of a pre- 
ferred updating stratagem. Block 80 represents ini- 

45 tialization of the system, for example by loading of 
a set of training input vectors and respective asso- 
ciated target vectors, resetting the address counter 
of processor 54 and other representative registers. 
Also a preliminary set of network parameters is 

so entered. In block 82 the first input vector is pre- 
sented along line 66, after which all perceptron 
elements process their input data according to their 
intended function. In an initial setting, all parameter 
values could be equal to one', all elements (except 

55 of the first layer) doing a straight analog add opera- 
tion. Alternatively, the parameter values may be 
initialized in an arbitrary way, for example, in that 
they are set manually to a uniform value that gives 
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convergence. When a new training session is to be 
started the old manual set is automatically re- 
trieved. The uniform value of course depends on 
the network configuration, Now, the network as a 
whole operates synchronously, that is, the differ- 
ence between result vector and target vector is 
finalized after all signals have settled. The various 
elements may operate either asynchronous with 
respect to each other, or under adequate clock 
control, so that each element only starts its opera- 
tion after all input signals thereto have settled. This 
second mode of operation may be more suitable in 
case the input values are multi-valued digital. In 
block 84, between blocks 80 and 82, a stop cri- 
terion is applied. This may consider whether all 
input vectors of the intended set for training had 
been presented a sufficient number of times. This 
number may be a preset number with respect to 
the set overall. Another stratagem is to calculate 
the summed difference for all result vectors or 
counting the number of differences that would be 
larger than a discrimination threshold and stopping 
when this count, after successive decrementing, 
reaches another lower threshold. If the answer to 
application of the stop criterion is positive, the 
system exits to block 86 while signalling a -ready- 
condition to a user. 

If the stop criterion did not become effective, 
after the forward propagation in block 82, the sys- 
tem goes to block 88. Herein, an update correction 
is by means of the well-known method of error 
back propagation according to the steepest de- 
scent method applied to all parameter values. Now, 
the MLP (multilayer perceptron) model is such that 
an element belonging to a particular layer sums up 
outputs of a previous layer, and propagates this to 
the next layer after applying a known Sigmoid 
function. That is, output Oj of a node j is given by 



o, = f (net.:) = 3 — . 

3 1 + e - net j 
net ♦ = sum „ * A , , 
J ^ Wj^ o^ + thetaj 



where Oj is an output value of the previous layer 
and thetaj is a bias vlaue of the particular node 
itself. In error back propagation every parameter 
value is adjusted to minimize the energy that is 
defined as the summation of the squared output 
errors between actual output o pj and the target 
output t P j which corresponds to input vector 
(pattern) p. The following expression gives the up- 
dating rule for the parameter values (or weights): 
Aw j; (n + 1) = eta^-Of + aAWjj(n) 
Herein, the symbols have following meanings: 
i. j, k are the indices of the elements of the succes- 
sive layers of elements, with i pertaining- to the 



input layers, j pertaining to the hidden layer, and k 
pertaining to the output layer. For the moment, a 
single hidden layer is presumed. So, quantity wji is 
the multiplicative parameter value in the connection 

5 from input element number i to hidden layer ele- 
ment number j. Likewise, w kj relates to interconnec- 
tion from hidden layer element j to output layer 
element k. The quantity n is the iterative step 
number with respect to the application of the input 

io vector in question. For each first application of the 
input vector in question, this quantity equals zero. 
The factor etao is a learning rate quantity. 5j is a 
generalized difference quantity which for the output 
layer has been defined as follows: 

15 *j = (t pj - o pj ) f' (net,) 

Here, t is the target value, o is the result value. For 
the hidden layer, the generalized difference is 



20 



5 j = ! u 5! 1 6 k ■ w kj • f'(netj) 



Quantity a is the so-called momentun rate quantity. 

25 This reflects the correction effected in the most 
recent presentation of an input vector. Naturally, in 
the first updating operation, this second term is 
equal to zero. The choosing of the value of the 
learning rate etao »s discussed hereinafter. After 

3Q execution of the correction step in block 88 (back 
propagation), the same input vector is presented 
again in block 90 (the value of n is not incre- 
mented, however) and the maximum difference for 
each processing element between the input values 

35 from the net before and after the (most recent) 
adjustment of the weight parameters is checked for 
quantizing the improvement reached by the back 
propagation in block 88. In block 92 the value of 
-etao- is adjusted for optimum speed of the ap- 

4Q proximation process as will be explained 
hereinafter. Thereafter, the system reverts to block 
84. It was found that the choices of eta c and a are 
crucial. A reasonable value of a can be 0.9 al- 
though the optimum value can be subject to some 

45 indifference. The value of etao should be adapted 
to the number of connections. One of the problems 
of the method is the convergence of the iteration. 
Notably, initial values must be chosen correctly, 
and care must be taken to not become trapped in a 

50 local optimum point which differs from the overall 
optimum point. 

Figure 1 shows how the learning rate is adapt- 
ed as based on the connectivity number N, M, K 
which are shown with respect to a processing ele- 

55 ment of the hidden layer. For the output layer. N is 
defined as the input connectivity of that layer, while 
K = M = 1 . With a standard value eta 00 independent 
of actual connectivity, the learning rate eta oj is 
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adapted according to: 
eta 0 j = etaoo . m 

Therefore, in the example of Fig. 1 , for the hidden 
layer, the factor M/NK = 4/8 = 1/2, for the output 
layer, the factor M/NK = 1/N = 1/4 (there N is the 
number of inputs to the output element in ques- 
tion). If not all elements of the output layer have 
the same input connectivity, a weighting operation 
on these values is effected. Usually, however, the 
elements of any particular layer have uniform input 
connectivity and output connectivity. The back- 
ground of the above adaptation rule is the follow- 
ing. Every element has the same component such 
as addition (E) and threshold function (f). Normally, 
the Sigmoid function is used as f because of the 
differentiability and the normalization between 0 to 
1. To simplify, assuming a = 0, the difference to be 
updated becomes 
AWj = eta o 5j0i (3) 

After every weight is updated using the above 
relation, again using the same input vector, the 
actual difference of summation inputs (Anetj) is the 
following: 

Anetj = etaoSjEOi 2 (4) 

Reversely, the desired etaoCan be calculated from 
this. It is noted that 



particularly advantageous aspect of the invention, 
in as much in general, it is not known which value 
should be optimum. 

The reason for taking the L-th root is for nor- 
s malizing the growth of the learning rate, where the 
contribution of each processing element would be 
a multiplication favtor. Note that 
eta, = eta D . f(M, N f K). 

wherein only the value of etao is adjusted. More 

70 precisely, the method followed is depicted in Fig. 
2a. The legends of the various blocks are: 
100: KO - 1; 102: iterate in terms of a particular 
network layer, 104: search absolute maximum val- 
ue among this particular layer; 106 if this absolute 

15 maximum is smaller than a preassigned threshold, 
then increase KO by a predetermined amount, if 
smaller then decrease KO by a predetermined 
amount. In various experiments, the amounts were 
+ 20% and -20%, but other and also, unsymmetric 

20 values are useful as well. After termination of this 
iteration, 108: take the L-th root of Ko, and calculate 
the new learning rate eta n +i = Kq . eta n . Instead of 
searching for the absolute maximumn in block 104, 
also every particular processing element could be 

25 interrogated cyclically and its contribution used di- 
rectly in block 106. 



K 

sum 

k = 1 30 

5 k Wki means that each 5j is multiplied by w ki , and is 
added K times. Moreover. w kj is one of the weights 
used as a value on the connection to the next 3S 
element which has M inputs. Thus the above guess 
(etao = 1) shows a good result. But to be more 
precise, values Anetj, 5j, w kj , 0 ( should be taken 
into account. These values are to be changed 
during calculations. The preferable method is to ^ 
change etao adaptively. 

Now, in Figure 2 the second forward propaga- 
tion allows to determine whether the maximum 
absolute difference of input (|Anet|) is above the 
given criterion or not If the difference is smaller 4fi 
than the criterion value (in this case, 0.1 is taken), 
etao is increased. Then finally, as an overall learn- 
ing rate, the L-th root of the value so found is 
taken. Herein, L is the total number of processing 
elements in the network. The new value of etao is 5£ 
used in the next forward propagation step. Experi- 
mentally the following behaviour for the learning 
rate was found best to use. The starting value is 
small. Thereafter it moves to somewhat larger val- 
ue. Various experiments have shown intermediate 5J 
stabilization at etao s 1. In the final stages of 
learning the learning rate got larger. It is suggested 
that starting with a rather small learning .rate is a 



DESCRIPTION OF THE ALGORITHM USED 

Figures 3a-3c show a set of formulae explain- 
ing the optimization. First, the following set of pa- 
rameters is defined: : 
n p - number of prototype patterns (input vectors); 
nj - number of elements in (last) hidden layer; in 
the example only one hidden layer is present; 
n k - number of output elements; 
y pk - the output value of output element k for input 
pattern p; 

t pk - the target value for output element k for input 
pattern p. 

f - the transfer function of a processing element, 
such as 1/(1 +e- x ). x being the input value. 
Hereabove the Sigmoid function was stated. 
netp k - the total input to processing element k upon 
presentation of input vector p. 

BASIC ERROR BACK PROPAGATION 

Now the error function E of a multilayer per- 
ceptron is defined as according to Figure 3a. In the 
third line the error function is further factorized. 
How, this function must be minimized by adjusting 
the values of wjj. Error back propagation is a steep- 
est descent method, where the weights are 
changed in the direction opposite to the gradient. 
Given the particular multilayer architecture, the par- 
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tial derivatives of E with respect to the weights can 
be calculated recursively, on a layer by layer basis. 

For weights w kj connecting elements j of the 
(last) hidden layer to elements k of the output layer 
partial differentiation with respect to w kj , for a given 
input pattern p gives (dropping the subscript p for 
clarity's sake), gives the expressions of Figure 3b. 

Differentiation of the expressions of Figure 3a, 
third line, with respect to the weight factors at the 
input side of the (final) hidden layer that imme- 
diately precedes the output layer yields the expres- 
sions of Figure 3c. 

Now, it has been found that the convergence of 
the method according to prior art is rather slow. In 
remedy thereto, the following stratagems are pro- 
posed: 

a) first for each interconnection step, i.e. for the 
interconnections between the input layer and the 
hidden layer and for the interconnections between 
the hidden layer and the output layer a standard 
value of eta is chosen: eta<,. This value usually is 
substantially equal to one. The connectivity num- 
bers N, M, K have been defined earlier. Thus, in 
the situation of Figure 1 , with respect to the single 
hidden layer: N = 4, M = 4 ( K=2. Now, for a particu- 
lar updating operation, eta, = etaoM/NK = eta</2. 

In a particular experiment, the following three 
network set-ups were used: 

a) two input elements would each feed two 
elements in the hidden layer; the latter would 
both feed one output element: XOR2 

b) two input elements would each feed eight 
elements in the hidden layer; all of the latter 
would feed the single output element: XOR8 

c) the input elements would each feed 32 ele- 
ments in the hidden layer; all of the latter would 
feed the single output element: XOR32. 

These networks were used to train for recogni- 
tion of various pictures. For example, with the latter 
arrangement the output learning rate was chosen 
as equal 0,3 variation of the learning rate with 
respect to the intermediate layer from 0,3 to 55 
improved (lowered) the number of iterations by a 
factor of six. 

In another experiment a picture of 10x10 pixels 
was used as object, with the intermediate layer 
also consisting of 10x10 processing elements. For 
a learning rate of the output layer of 0,1, the 
learning rate of the intermediate layer was varied 
from 1 to 1000. This improved the number of 
iterations by a factor of 10-20. The normalization 
by means of the quantity L allowed for using the 
high values for the learning rate etao. 

A second speed-up is effected in block 92 of 
Figure 2. Herein, the quantity |Anet max |, with re- 
spect to any of the input vectors presented during 
one round along ail input vectors is compared with 
a preset discrimination level. If this quantity is 
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lower, this means that the risk of instability in the 
convergence is low and the convergence speed 
can be increased: eta 0 is increased, for example by 
+ 20%. If the quantity however is higher, the con- 

5 vergence speed may better be decreased, for ex- 
ample by decreasing eta Q by -20%. Of course, 
other fractional changes could apply as well, such 
as ±10%, ±30% or other within a reasonable inter- 
val. The positive percentage need not be equal to 

w the negative percentage. In a preferred mode, the 
discrimination level is at 0,1 . However, other values 
could be just as advantageous, such as 0,08; 0,15 
and others. The improved spee'd was found experi- 
mentally by trial and error method. 

15 

Claims 

1. A method for adjusting network parameters in a 
20 multi-layer perceptron device that has an initial 

layer of input elements, a final layer of processing 
output elements, and a sequence of at least one 
hidden layer of processing elements, wherein each 
preceding layer produces its output quantities to 

25 feed its next successive layer under multiplication 
by respective parameter values, said method com- 
prising under presentation of pairs of a source 
vector at the device input and a target vector at the 
device output of forward propagation steps for gen- 

30 erating a result vector and backward propagating 
steps wherein under control of a difference be- 
tween result vector and the associated target vec- 
tor said respective parameter values are updated in 
a steepest descent mthod having a normalized 

35 learning rate eta, wherein an initial guess for said 
learning rate is: 
eta, = eta Q x f(M,N,K), 

wherein etao is an overall learning rate for the layer 
in question, eta, is a learning rate for updating a 

40 particular parameter value, N is the number of 
inputs to the processing element fed by the param- 
eter value in question, K is the number of outputs 
from that processing element, and M is the number 
of inputs to processing elements of the next layer, 

45 and wherein the derivatives is positive, while ^ 
and 5^ are negative for the actual value ranges of 
M,N,K. 

2. A method as claimed in Claim 1, wherein for the 
layer of output elements 

so etai = eta Q x (f(M), 

N,K having a standard value of 1, and *k being 
positive. 

3. A method as claimed in Claim 1 or 2, wherein 
for any particular layer the function f(M,N,K) has a 

55 uniform value. 

4. A method as claimed in Claim 1, wherein f- 
(M,N,K) is substantially proportional to M. 

5. A method as claimed in Claim 1, wherein f- 
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(M,N,K) is substantially inversely proportional to N. 

6. A method as claimed in Claim 1, wherein f- 
(M.N.K) is substantially inversely proportional to K. 

7. A method as claimed in any of Claims 1 to 6, 
wherein for the whole network eta 0 has a uniform 
value. 

8. A method for adjusting network parameters in a 
multilevel perceptron device, said method compris- 
ing the steps of: 

- loading an input vector into input elements of said 
perceptron and propagating any processing result 
in a forward propagating step until generation of a 
result vector; 

- under control of a first difference between said 
result vector and an associated target vector in 
accordance to a steepest descent method adjusting 
said network parameters in a back propagating 
step; 

- repeating said forward propagating step with re- 
spect to the same input vector after said adjusting 
and calculating a second difference with respect to 
said target vector; 

- comparing an improvement between said differ- 
ences with a discrimination level and in case of a 
smaller improvement raising a learning rate of said 
steepest descent method but under control of a 
larger improvement lowering said basic parameter. 

9. A multilayer percentron device having input 
means for receiving an input vector, a plurality of 
processing layers of processing elements including 
at least one hidden layer of processing elements, 
comparison means for in a forward propagation 
step comparing said processind input vector to an 
associated target vector, and generating a feed- 
back control signal for updating respective network 
parameters in a backward propagating step under 
control of a learning rate; and repeat means for 
reactivating said input means, said processing lay- 
ers, and said comparison means with respect to 
the same input vector; and sequencing means for 
thereafter presenting a next input vector for pro- 
cessing in like manner as its preceding input vec- 
tor, wherein said learning rate is 

etai = etao x M/KN, 

wherein N is the number of inputs to the process- 
ing element fed by the updateable parameter value 
in question, K is the number of outputs from that 
processing element, and M is the number of inputs 
to the processing elements of the next layer. 

10. A device as claimed in Claim 9, wherein for the 
output layer M = K = 1. 

11. A multilayer perceptron device having input 
means for receiving an input vector, a plurality of 
processing layers of processing elements including 
at least one hidden layer of processing elements, 
first comparison means for in a forward propagation 
step comparing the processing input vector to an 
associated target vector, and generating a feed- 
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back control signal for updating respective network 
parameters in a backward propagating step under 
control of a learning rate parameter, and repeat 
means, for reactivating said input means, said pro- 
cessing layers, and said comparison means with 
respect to the same input vector; and sequencing 
means for thereafter presenting a next input vector 
for processing in like manner as its preceding input 
vector, and having second comparison means for 
comparing two comparison results thus attained for 
the same input vector as separated by said updat- 
ing, for comparing an improvement thus attained 
with a predetermined discrimination level, in case 
of a smaller improvement raising said learning rate, 
and in case of a bigger improvement decreasing 
said learning rate. 
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